FAIR and scalable management of small-angle X-ray scattering data

A modular and extensible research data management toolbox based on the programming language Python and the widely used computing platform Jupyter Notebook has been established for the acquisition, visualization, analysis and storage of small-angle X-ray scattering data.


Supporting tables Python package requirements: conda environment.yml files
The popular scientific computing package and environment manager conda was used in this work through Miniconda3, mainly for managing virtual environments. Only Python and Jupyter-related packages were installed with conda, all other packages and libraries were installed using pip. The "minimal" environment files only contain explicitly installed packages with their respective version for platform-agnostic environment creation. The "macos" files contain explicitly installed packages and all their dependencies with versions and exact build, the latter being MacOS/ARM64-specific, however.

Mapping of PDH to AnIML
From the XML metadata footer of the PDH files, only the column and parameter elements as well as their children are currently mapped to AnIML, as these nodes contain the most essential information (Table S2). Table S2. Mapping of PDH column, parameter, and value elements to AnIML Series, Unit, Category and Parameter elements.

Structure of the DaRUS metadata blocks
On DaRUS, selected fields from the Citation Metadata, Process Metadata, and Engineering Metadata blocks were used in this work (Table S3).

Output of fit parameters from Origin
The TXT-formatted output from Origin provides information on the parameters of the Lorentzian fits on SAXS peaks (Fig. S1). The first column contains the intensity I as the dependent variable. The second column gives the parameter names followed by the fitted value and its standard deviation. The t-value is the ratio of the fitted value and its standard deviation.
The Prob>|t|-value is the probability of the t-test and therefore allows inference to the significance of each parameter. Lastly, the dependency which is computed from the variancecovariance matrix further indicates the significance of each parameter. In the analysis and visualization toolkit merely the peak center values xc were used for further calculations.  Figure S1. Section from exemplary TXT file with fitting data obtained from Lorentzian fit in Origin.

Indexation of the cubic LLC phase
The multiple scattering maxima of the cubic LLC phase may be assigned to various Miller' indices (hkl) resulting in different possible space groups. The best fit was obtained for the body-

Guide to the Notebooks
Guide to the Notebooks Module 1: PDH to AnIML converter Following the preparational steps and creation of an AnIML object, available PDH files for conversion are called via a respective directory (red box).
As additional automation is indeed possible, the specification to a particular case as well as susceptability to error increases dramatically. Therefore, the AnIML object is built one dataset at the time. In the next step, one of the files from the previously printed dict_of_files is chosen by its index to proceed with.
With the data at hand, the elements of the AnIML object are built up from bottom to top. Firstly, the experiment and sample are labelled with name (experiment_name) and ID (Sample_id). These names can be assigned to the respective variables as a string containing text as well as created from a file name (here pdh_file). This, however, requires consistent naming of all measurement files.
Next, the experiment step object is created by assigning as name and an ID, similarly to the previous step (Case a). Alternatively, an existing experiment step within an AnIML object can be chosen (Case b). Additionally, a sample reference is added to the experiment_step providing the sample object, its role and purpose (Step 6) Step 7 offers the opportunity to add authors, device and software information to the AnIML object as instrument_parameters.
In the next step, actual measurement data is added as a series for every dimension. For that purpose, a Category is created or an existing one is accessed. The units of the columns are extracted from the metadata of the measurement files and the actual values are stored in IndividualValueSets. Another SeriesSet is created holding the measurement data and associated information which is added to the Category. The Category, in turn, is then added to the experiment_step. The experiment step which now contains all the information of one measurement is finally added to the AnIML object.
In a last step, an XML-formatted string is created from the AnIML object and serialized to the given AnIML document.
To add further datasets to the AnIML document, an existing document is called (Step 1, Case b), following steps are carried out as before. The finished AnIML document now contains all information needed to recreate a similar experiment as well as raw data of the measurements. Exemplary excerpts are shown in the following.
In the beginning of the AnIML document an overview over all contained sample data is given.
For each dataset the instrument information is given followed by the result which contains two series holding the scattering vector (in nm -1 ) and corresponding intensity (in counts per area).

Submodule 2.1: Lorentzian fit with Origin
Lorentzian fits of the measured peaks are carried out using an "external" software (Origin). For this purpose, the data stored in the AnIML document is converted to a TSV file.

Submodule 2.2: Analysis
In order to determine the lyotropic liquid crystalline (LLC) phase and the corresponding lattice parameter a, several steps are necessary. To be able to add the analysis data to the AnIML document afterwards, a respective experiment_step must be accessed and added a new Category which will hold the analyses. The next step involves the import of Lorentzian fit data obtained from Origin (or any other analysis software). The available files are stored in a data frame from which a file can be chosen by its list index later. The calculated and corrected q-values of the scattering maxima are subsequently added to the AnIML document within a Category.
Next, the lattice plane distance is calculated (in nm) from the corrected q-values and subsequently added to the AnIML document in another Category. The lattice plane ratio d, as well, is calculated and added to the AnIML document.
With this information, the LLC phase can now be determined. As certain phases exhibit characteristic lattice plane ratios they are checked against given conditions. If a phase is determined, the corresponding lattice constant a is calculated accordingly. If the phase is indeterminate further analysis by visualization can be carried out (see Submodule 3.2: Diffractograms).
If a cubic phase is interpreted from the diffractograms (or the above script) the space group can be specified by comparing measured reciprocal d -1 -values versus √ℎ 2 + 2 + 2 . The closer the R²-value of the resulting plot is to unity; the more likely the assigned Miller indices and corresponding space group are. With another chapter in this notebook, the space group specifying the cubic phase can be determined by creating such a plot by input of different Miller indices. Obtained results can afterwards be added to the AnIML file.

Submodule 2.2: Diffractograms
In a first part of this Notebook, data visualization of two parameters is possible with data from the AnIML document. Therefore, a respective AnIML document is chosen by its directory, the measurement data of one or more samples selectable through their IDs in files_to_plot.
With the data from files_to_plot, two-parameter plots are created for each dataset.